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Abstract 

Although  semi-supervised  learning  has  been 
an  active  area  of  research,  its  use  in  de¬ 
ployed  applications  is  still  relatively  rare 
because  the  methods  are  often  difficult  to 
implement,  fragile  in  tuning,  or  lacking  in 
scalability.  This  paper  presents  expecta¬ 
tion  regularization,  a  semi-supervised  learn¬ 
ing  method  for  exponential  family  paramet¬ 
ric  models  that  augments  the  traditional 
conditional  label-likelihood  objective  func¬ 
tion  with  an  additional  term  that  encour¬ 
ages  model  predictions  on  unlabeled  data 
to  match  certain  expectations — such  as  la¬ 
bel  priors.  The  method  is  extremely  easy  to 
implement,  scales  as  well  as  logistic  regres¬ 
sion,  and  can  handle  non-independent  fea¬ 
tures.  We  present  experiments  on  five  dif¬ 
ferent  data  sets,  showing  accuracy  improve¬ 
ments  over  other  semi-supervised  methods. 

1.  Introduction 

Research  in  semi-supervised  learning  has  yielded  many 
publications  over  the  past  ten  years,  but  there  are  sur¬ 
prisingly  fewer  cases  of  its  use  in  application-oriented 
research,  where  the  emphasis  is  on  solving  a  task,  not 
on  exploring  a  new  semi-supervised  method.  This  may 
be  partially  due  to  the  natural  time  it  takes  for  new 
machine  learning  ideas  to  propagate  to  practitioners. 
We  believe  it  is  also  due  in  large  part  to  the  complex¬ 
ity  and  unreliability  of  many  existing  semi-supervised 
methods. 

The  goal  of  our  work  here  is  to  propose  a  simple  semi- 
supervised  learning  method  that  consistently  provides 
accuracy  improvements,  that  is  robust  across  many 
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problem  domains  without  meta-parameter  tuning,  and 
scalable  to  extremely  large  unlabeled  data  set  sizes. 

This  paper  presents  expectation  regularization 
(XR),  a  new  method  for  semi-supervised  learning 
with  exponential-family  parametric  models.  Many 
exponential-family  models  such  as  logistic  regression 
and  multi-class  maximum  entropy  classifiers  are  opti¬ 
mized  by  maximizing  the  conditional  log-likelihood  of 
the  true  labels  given  the  input  features.  XR  augments 
this  objective  function  by  adding  a  second  term  that 
encourages  model  predictions  on  unlabeled  data  to 
match  certain  designer-provided  expectations.  In 
particular,  the  XR  term  minimizes  the  KL-divergence 
between  feature/label  expectations  predicted  by  the 
model  and  human-provided  feature/label  expectation 
priors. 

In  this  paper  we  empirically  explore  one  important 
special  case  termed  label  regularization,  in  which  the 
human  provides  a  label  prior  distribution,  and  the  XR 
term  encourages  the  optimization  procedure  to  find 
parameters  that  predict  a  similar  label  distribution  on 
the  unlabeled  examples.  (Intuitively  one  can  see  that 
this  prevents  a  typical  failure  case  of  several  alternative 
semi-supervised  methods,  in  which  the  learned  model 
predicts  the  same  label  for  almost  all  inputs.)  Ap¬ 
propriate  label  distributions  are  often  easily  provided 
by  human  prior  knowledge;  alternatively  they  can  be 
obtained  from  the  limited  labeled  data,  from  which 
they  can  be  estimated  far  more  accurately  than  sparse 
input  feature  distributions.  We  show  below  that  XR 
is  surprisingly  robust  to  inaccuracies  in  the  provided 
label  distribution  prior. 

Expectation  regularization  offers  a  number  of  prac¬ 
tical  advantages  over  previous  semi-supervised  learn¬ 
ing  methods.  It  is  simple  to  implement  and  to  use — 
requiring  no  pre-clustering  of  unlabeled  data,  no  in¬ 
verted  index  for  graph  construction,  no  “auxiliary 
functions”  and  no  “contrastive”  examples.  It  has  two 
meta-parameter  terms,  both  of  which  require  little  or 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

JUN  2007 


2.  REPORT  TYPE 


4.  TITLE  AND  SUBTITLE 

Simple,  Robust,  Scalable  Semi-supervised  Learning  via  Expectation 
Regularization 

6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  Massachusetts, Department  of  Computer 
Science, Amherst, MA, 01003 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


3.  DATES  COVERED 

00-00-2007  to  00-00-2007 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

Although  semi-supervised  learning  has  been  an  active  area  of  research,  its  use  in  deployed  applications  is 
still  relatively  rare  because  the  methods  are  often  difficult  to  implement,  fragile  in  tuning,  or  lacking  in 
scalability.  This  paper  presents  expectation  regularization,  a  semi-supervised  learning  method  for 
exponential  family  parametric  models  that  augments  the  traditional  conditional  label-likelihood  objective 
function  with  an  additional  term  that  encourages  model  predictions  on  unlabeled  data  to  match  certain 
expectations?such  as  label  priors.  The  method  is  extremely  easy  to  implement,  scales  as  well  as  logistic 
regression  and  can  handle  non-independent  features.  We  present  experiments  on  five  different  data  sets, 
showing  accuracy  improvements  over  other  semi-supervised  methods. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

8 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Semi-Supervised  Learning  via  Expectation  Regularization 


no  tuning  and  are  not  overly  sensitive.  It  is  purely 
conditional  on  inputs,  and  thus  can  robustly  handle 
arbitrarily  overlapping,  non-independent  feature  sets. 
It  is  a  parametric  model,  and  thus  it  can  be  applied 
quickly  to  new  instances  without  requiring  the  storage 
large  quantities  of  the  labeled  and  unlabeled  training 
data.  Not  only  can  XR  perform  well  with  many  la¬ 
beled  examples,  but  unlike  other  methods  it  also  ex¬ 
cels  at  very  small  levels  of  labeled  data  (as  little  as  one 
per  class).  Significantly,  it  scales  up  to  vast  numbers 
of  unlabeled  points  (easily  millions).  It  is  quite  robust; 
in  our  experiments  it  provided  consistent  accuracy  im¬ 
provements. 

We  present  experimental  results  on  five  different  data 
sets,  and  compare  against  seven  different  alternative 
supervised  and  semi-supervised  methods.  Across  the 
data  sets  XR  outperforms  naive  Bayes,  SVMs,  EM, 
maximum  entropy,  entropy  regularization  (serving  also 
as  a  stand-in  for  transductive  SVMs),  cluster  kernels, 
as  well  as  a  graph-based  method.  The  only  times  when 
XR  under-performs  an  existing  method  is  (a)  a  radial- 
basis-function  SVM  in  the  case  of  large  amounts  of 
labeled  data,  and  (b)  naive  Bayes  EM  on  a  simple, 
extremely  sparse  data  set,  where  naive  Bayes  outper¬ 
forms  maximum  entropy.  We  also  demonstrate  ro¬ 
bustness  to  error  in  prior  estimation  and  across  meta¬ 
parameter  settings. 

In  future  work  we  will  experiment  with  expectations 
on  features  other  than  labels,  and  will  also  apply  these 
methods  to  structured  models,  such  as  conditional  ran¬ 
dom  fields  (Lafferty  et  al.,  2001;  Sutton  &  McCallum, 
2006),  which  are  a  natural  fit  for  XR. 

2.  Related  Work 

There  have  been  many  different  approaches  to  semi- 
supervised  learning  over  the  past  decade  that  have 
shown  various  accuracy  improvements.  Here  we  dis¬ 
cuss  some  of  the  most  popular  methods:  generative- 
models  with  EM,  other  “cluster-based”  methods, 
auxiliary-function  methods,  and  graph-based  meth¬ 
ods. 

Generative  models  trained  by  expectation  maximiza¬ 
tion  (Dempster  et  al.,  1977)  have  had  a  long  history  in 
semi-supervised  machine  learning.  Nigam  et  al.  (1998) 
present  a  semi-supervised  naive  Bayes  model  for  text 
classification,  and  this  method  has  also  been  applied 
to  structured  classification  problems  such  as  part-of- 
speech  tagging  (Klein  &  Manning,  2004).  However, 
while  EM  sometimes  works  very  well,  it  can  be  fragile, 
finding  solutions  that  are  worse  than  the  equivalent 
supervised  model.  Cozman  and  Cohen  (2006)  discuss 


the  risks  of  using  EM  and  describe  situations  where  it 
can  fail. 

Other  “cluster-based”  methods  are  discriminative,  di¬ 
rectly  aiming  to  place  the  decision  boundary  in  low- 
density  regions.  For  example  transductive  support 
vector  machines  (TSVMs)  (Joachims,  1999)  explicitly 
model  the  distance  between  classes  by  simultaneously 
searching  over  labelings  of  unlabeled/test  instances 
and  margins  between  regions  of  similarly-labeled  in¬ 
stances.  This  search  can  be  expensive,  and  TSVMs 
have  difficulty  handling  large  number  of  unlabeled  in¬ 
stances,  with  running  time  0(n3)  as  originally  de¬ 
scribed;  although  Sindhwani  and  Keerthi  (2006)  pro¬ 
pose  a  method  for  speeding  up  training  in  some  cases. 
Furthermore,  in  our  experience,  TSVMs  require  ex¬ 
tensive  and  delicate  tuning  of  meta-parameters.  We 
note  that  Sindhwani  and  Keerthi  report  results  with 
meta-parameters  tuned  on  test  data. 

Another  cluster-based  method  with  significantly  faster 
training  times  is  entropy  regularization  (Grandvalet 
&  Bengio,  2004).  Here  a  traditional  conditional  label 
likelihood  objective  function  is  augmented  with  a  sec¬ 
ond  term  that  minimizes  the  entropy  of  the  label  dis¬ 
tribution  predicted  on  unlabeled  data.  Chapelle  et  al. 
(2006)  give  empirical  evidence  that  entropy  minimiza¬ 
tion  performs  as  well  as  (if  not  better  than)  TSVMs, 
(when  the  SVM  is  given  a  linear  kernel).  However  en¬ 
tropy  regularization  also  requires  extremely  sensitive 
tuning  of  the  relative  weight  between  the  two  terms. 
Furthermore,  when  faced  with  small  amounts  of  la¬ 
beled  data  and  vast  amounts  of  unlabeled  data,  en¬ 
tropy  minimization  is  unstable,  preferring  solutions 
where  all  points  are  assigned  the  same  label.  (We  note 
that  our  label  regularization  can  easily  be  combined 
with  entropy  regularization  to  avoid  this  problem.) 
Another  fast  cluster-based  method  is  information  regu¬ 
larization  (Corcluneanu  &  Jaakkola,  2003),  which  mea¬ 
sures  distance  via  the  mutual  information  between  a 
classifier  and  the  marginal  distribution  p(x). In  general, 
if  the  cluster  assumption  is  violated  (be.  the  classes 
are  not  widely  separable)  assigning  decision  bound¬ 
aries  to  low  density  regions  is  a  poor  choice. 

Instead  of  using  data  clustering  directly  to  position 
the  decision  boundary,  other  methods  pre-cluster  un¬ 
labeled  data,  and  use  these  clusters  as  features  for  su¬ 
pervised  training  on  the  labeled  data  (Miller  et  al., 
2004;  Li  &  McCallum,  2005).  These  methods  can  work 
well  when  natural  unsupervised  clusterings  are  corre¬ 
lated  with  the  supervised  task,  and  when  the  amount 
of  labeled  data  is  not  too  small.  Auxiliary-task  meth¬ 
ods  (Ando  &  Zhang,  2005)  embed  the  cluster-discovery 
into  supervised  training;  contrastive  methods  (Smith 
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&  Eisner,  2005)  perturb  the  input  space.  Although 
these  methods  have  been  demonstrated  to  produce  im¬ 
pressive  gains,  both  are  quite  sensitive  to  the  selection 
of  auxiliary  information,  and  making  good  selections 
requires  significant  insight.1 

Graph-based  methods,  also  known  as  manifold  meth¬ 
ods,  have  been  widely  applied  to  semi-supervised 
learning,  and  can  be  highly  accurate.  Here  a  graph 
(typically  with  weighted  edges)  is  formed  over  the  la¬ 
beled  and  unlabeled  points,  and  points  are  assigned 
labels  based  on  the  labels  of  their  neighbors.  Zhu 
and  Ghahramani  (2002)  propose  label  propagation, 
where  labels  propagate  from  labeled  instances  to  unla¬ 
beled  instances.  Szummer  and  Jaakkola  (2002)  present 
a  closely  related  approach  which  uses  random  walks 
through  the  graph  to  assign  labels.  Li  and  McCal- 
lum  (2004)  examine  simultaneous  pair-wise  distance 
and  classification  boundaries,  which  produces  an  im¬ 
plicit  clustering  over  points.  However,  like  TSVMs, 
graph-based  methods  are  slow,  requiring  time  0(n3) 
or  0(kn2)  where  k  is  the  number  of  neighbors.  They 
also  are  not  compact  parametric  models — they  require 
that  labeled  and  unlabeled  data  be  stored  and  used  to 
classify  new  instances.  Sub-sampling  unlabeled  data 
can  reduce  runtime  from  0(n 3)  to  0(mrn )  (Delalleau 
et  al.,  2006),  but  subsampling  does  not  take  full  advan¬ 
tage  of  available  unlabeled  data.  Other  techniques  for 
speeding  up  training  can  reduce  the  time  complexity 
to  0(m3),m  <  n1  but  may  reduce  performance  (Zhu 
&  Lafferty,  2005).  In  this  paper  we  compare  against  a 
representative  graph-based  label  propagation  method 
called  Quadratic  Cost  Criterion  (QC)  (Bengio  et  al., 
2006)  whose  results  are  reported  in  Chapelle  et  al. 
(2006). 

Some  semi-supervised  learning  methods  other  than  our 
expectation  regularization  have  also  used  label  prior 
distributions,  but  in  quite  different  ways.  For  ex¬ 
ample,  class  mean  normalization  (CMN)  (Zhu  et  al., 
2003)  employs  class  priors  as  a  post-processing  step  to 
set  thresholds  on  the  propagation  of  a  label.  Condi¬ 
tional  harmonic  mixing  (Burges  &  Platt,  2006)  is  an¬ 
other  graph-based  method  that  minimizes  over  each 
point  the  KL-divergence  between  the  currently  pre¬ 
dicted  label  distribution  and  the  distribution  predicted 
by  its  neighbors.  Schapire  et  al.  (2002)  use  a  human¬ 
generated  prior  on  model  parameters  and  minimize  the 
per-instance  KL-divergence  between  the  label  distribu¬ 
tion  predicted  by  the  prior  model  and  that  predicted 
by  the  learned  model.  Schuurmans  (1997)  uses  pre¬ 
dicted  label  distributions  on  unlabeled  data  for  model 
structure  selection  (as  opposed  to  parameter  estima- 

1  Personal  communication,  F.  Pereira 


tion). 

There  are,  of  course,  cases  of  semi-supervised  learn¬ 
ing  being  used  in  application  settings,  however,  often 
with  various  difficulties.  For  example,  Macskassy  and 
Provost  (2006)  apply  harmonic  mixing  to  classification 
in  relational  data,  but  complain  about  running  time 
and  prefer  a  simpler  method.  Niu  et  al.  (2005)  apply 
label  propagation  to  word  sense  disambiguation,  and 
show  that  performance  is  sensitive  to  choice  of  metric 
for  constructing  graph.  Merialdo  (1994),  in  a  now  fa¬ 
mous  negative  result,  attempts  semi-supervised  learn¬ 
ing  to  improve  HMM  part-of-speech  tagging  and  finds 
that  EM  with  unlabeled  data  reduces  accuracy.  Klein 
and  Manning  (2004)  show  that  with  very  clever  ini¬ 
tialization,  however,  EM  can  help.  Kockelkorn  et  al. 
(2003)  use  transductive  SVMs  for  text  classification, 
but  complain  that  it  is  computationally  costly. 

3.  Expectation  Regularization 

Many  of  the  methods  discussed  above  use  knowledge 
of  the  marginal  p(x)  either  explicitly  (Corduneanu 
&  Jaakkola,  2003)  or  implicitly  (Grandvalet  &  Ben¬ 
gio,  2004)  in  deciding  where  to  place  decision  bound¬ 
aries.  Given  knowledge  of  the  marginal,  these  methods 
formulate  regularization  criteria  which  favor  decision 
boundaries  that  are  placed  in  areas  of  low  density. 

Expectation  regularization  uses  an  additional  source  of 
knowledge:  beliefs  about  the  conditional  probabilities 
of  labels  given  features,  p(y\xj).  These  expectations 
can  be  obtained  through  various  means,  either  from 
estimation  on  labeled  data  or  through  human  prior 
knowledge.  This  type  of  information  constitutes  a  new 
modality  of  supervision,  where  instead  of  labeled  ex¬ 
amples,  the  user  provides  beliefs  about  selected  condi¬ 
tional  probabilities. 

Domain  knowledge  can  be  supplied  to  the  classifier 
in  a  flexible  way  using  expectation  regularization.  In 
many  domains,  class  priors,  p(y),  are  a  valuable  source 
of  information  that  are  often  approximately  known  to 
the  classifier  designer.  For  example,  in  university  web 
page  classification,  one  might  estimate  that  roughly 
60%  of  the  personal  home  pages  belong  to  students. 
In  other  cases,  we  may  have  expectations  about  the 
relationships  between  features  and  labels.  For  exam¬ 
ple,  in  the  named-entity  recognition,  we  may  estimate 
that  in  newswire  text  50%  of  capitalized  words  are 
named  entities.  In  gene  name  tagging,  there  may  be  a 
75%  probability  that  a  word  is  a  gene  if  it  ends  with 
the  morpheme  “gene.”  Classifier  designers  tradition¬ 
ally  employ  features  that  they  know  are  correlated  to 
labels.  With  expectation  regularization  the  classifier 
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designers  can  also  supply  estimated  feature/label  ex¬ 
pectations.  (Experimental  results  below  show  that  our 
method  is  surprisingly  robust  to  a  wide  range  of  errors 
in  these  estimates.) 

Given  these  expectations,  we  introduce  a  regular- 
izer  that  penalizes  classifiers  whose  conditional  prob¬ 
abilities  pg  (y | Xj )  on  unlabeled  data  deviate  from  the 
human-provided  expectations  p. 

Consider  a  set  of  unlabeled  data  U  =  {u\..un),  where 
each  data  instance  u  comprises  a  feature  vector  x ^  = 
(x^-.x^).  Since  we  do  not  have  access  to  the  com¬ 
plete  marginal  p(x),  we  use  the  unlabeled  empirical 
distribution  p{x)  to  compute  the  conditional  probabil¬ 
ities  pe(y\xj  =  1) 

Pe  =  pe(y\xj  =  1)  =  ^p{x-j  \xj  =  l)pg(y\xj  =  1, 

X-j 

=  T77T  Z)  My\x), 

where  Uj  is  defined  to  be  {x  €  U  :  x3  =  1}.  Here,  the 
notation  X-j  is  used  to  indicate  {x  \  Xj}  (all  features 
apart  from  Xj).  The  expectation  regularization  term 
added  to  the  objective  function  is 

A(p,pe), 

where  p  is  the  human-provided  conditional  probabil¬ 
ity  and  pg  is  the  model’s  expected  conditional  proba¬ 
bility,  and  A  is  a  distance  metric.  In  this  paper,  we 
explore  one  particular  choice  of  distance  metric:  KL- 
divergence.  This  choice  of  A  is  equivalent  to  augment¬ 
ing  the  likelihood  with  a  Dirichlet  prior  over  expecta¬ 
tions  where  values  for  the  priors  a  are  proportional  to 
p.  KL-divergence  can  be  factored  into  two  parts 

A(P,Po)  =  D{p\\pg)  =  Vplog  4- 

y  PS 

=  -  'Y^P'togpg  +  J^plogp 
y  y 

=H(p,pe) 

Since  H(p)  is  constant  with  respect  to  the  model  pa¬ 
rameters,  minimizing  the  KL-divergence  can  also  be 
seen  as  minimizing  the  cross  entropy  of  a  hypothe¬ 
sized  distribution  and  the  expected  distribution  on  the 
unlabeled  data,  H{p,pg).  Note  that  this  is  distinct 
from  the  traditional  log- likelihood.  The  log-likelihood 
is  equivalent  to  the  cross  entropy  over  instances  where 
for  each  instance  only  the  correct  label  has  non-zero 
probability.  In  this  regularization  term,  p  and  pg  are 
the  expected  distributions  averaged  over  all  instances. 


We  apply  expectation  regularization  to  conditionally 
trained  log-linear  maximum  entropy  models,  which  are 
also  known  as  multinomial  logistic  regression  models. 
In  these  models,  the  probability  of  the  class  label  y  for 
a  data  instance  x  is  calculated  by 

Pe{y\x)  =  ^-rexp 
Z(x) 

where  Z(x)  =  J2yexP CCfc  ®kXk)  is  the  partition  func¬ 
tion.  Given  training  data  D  =  {d\..dn),  the  model  is 
trained  by  maximizing  the  log-likelihood  of  the  labels 

£{e-  D)  =  ^°SPe(yW\xw). 

d 

This  can  be  done  by  gradient  methods  (Malouf,  2002), 
where  the  gradient  of  the  likelihood  is 

^-£{e-D)  =  J2x<id)  ~ 

fc  d  d  y 

For  semi-supervised  discriminative  training,  we  aug¬ 
ment  the  objective  function  by  adding  regularization 
terms  on  the  unannotated  data.  (Here  Gaussian  prior 
is  also  shown.) 

WD,U)  =^logpe(yW\xW)  -  -  AA (p,pe). 

d 

In  practice,  we  find  that  A  does  not  need  tuning 
for  each  data  set.  We  set  it  simply  to  A  =  10  x 
#  labeled  examples. 

As  an  important  special  case  of  expectation  regular¬ 
ization,  we  examine  label  regularization ,  in  which  the 
features  in  question  are  the  “default  features,”  where 
\/x  :  Xj  =  1.  In  this  case,  the  goal  of  the  regular- 
izer  is  to  match  the  prior  distribution  on  labels.  Note 
that  this  useful  special  case  is  not  available  to  Schapire 
et  al.  (2002)  because  expectation  regularization  is  a 
global  regularizer  as  opposed  to  a  local  regularizer.  If 
the  model  exactly  matched  the  label  expectation  on 
a  per-instance  basis,  in  application  it  would  assign  all 
instances  to  the  majority  class. 

3.1.  Expectation  Regularization  Gradient 

This  section  presents  the  gradient  for  KL-divergence 
based  expectation  regularization.  First,  we  define  the 
unnormalized  potential 

Qe  =  qg{y\xj  =  1)  =  Pe(y\x)- 

x€:Uj 

After  dropping  terms  in  ^-D{p\\pg)  which  are  con¬ 
stant  with  respect  to  the  partial  derivative,  we  are  left 
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with 


d 

ddk 


1/  W  xgt/,- 


=  5Z  Ye  ^  ~  5Zp®(P,|a;)a:fc 


^  V  90 


-  X  Ype^yl^XkY 

xeUj  y'  y 


P  X  p{y\x) 
Qe 


=  Y  Ype^x^xk 

xeUj  y 

x  (  p_  _  y'  p  x  P0(y'\x) 

\qe  ,  qe 

y 

When  p  cc  qg  (the  expected  unlabeled  distribution 
matches  the  labeled  distribution)  the  gradient  is  0. 
This  conforms  to  the  intuition  behind  the  development 
of  the  regularizes 


3.2.  Temperature 

Label  regularization  can  occasionally  find  a  degener¬ 
ate  solution  where,  rather  than  the  expectation  of  all 
instances  matching  the  prior  distribution,  instead,  the 
distribution  over  labels  for  each  instance  will  match 
the  given  distribution  on  every  example.  For  example, 
given  a  three  class  classification  task,  if  the  labeled 
class  distribution  p(y)  =  {.5,  .35,  .15},  it  will  find  a 
solution  such  that  pg(y)  =  {.5,  .35,. 15}  for  every  in¬ 
stance.  As  a  result,  all  the  test  instances  will  be  as¬ 
signed  the  same  label. 

One  solution,  appealing  to  0/1  loss,  would  be 
to  simply  measure  and  match  the  expectation 
over  winning  class  counts,  calculating  pg  as 
XT  TlxtUj  <%>  argmaxy'  Pe(y'\x)).  However,  this 
is  not  differentiable.  So  instead,  we  make  pg(y \x) 
more  peaked  using  a  temperature  less  than  1. 

pg{y\x)  oc  exp  (  ^Y,dkXk 

\  k 

This  is  differentiable  and  thus  amenable  to  many  gra¬ 
dient  ascent  methods.  In  practice  we  find  that  this 
meta-parameter  does  not  require  fine-tuning.  Across 
all  data  sets  we  simply  use  T  =  0.1  for  multi-class 
problems  and  T  =  1  for  binary  classification  problems, 
and  we  find  this  to  work  well. 


4.  Experimental  Results 

We  evaluate  on  five  different  data  sets,  and  compare 
against  seven  different  methods  (both  supervised  and 


Name 

#  Points 

#  features 

#  classes 

SRAA 

40k 

77,494 

4 

POS 

40k 

11,520 

44 

SecStr 

83k 

314  (45,436) 

2 

BIOII 

200k 

54,958 

3 

CoNLLOS 

200k 

114,264 

9 

Table  1.  The  data  sets  are  complex:  they  have  dramatic 
class  skews,  highly  inter-dependent  features,  and  large 
amounts  of  data.  The  SecStr  data  set  has  315  atomic  fea¬ 
tures,  and  45k  features  when  pairwise  feature  conjunctions 
are  used. 


semi-supervised) .  We  experiment  with  varied  amounts 
of  data,  from  one  instances  per  class  up  to  thousands 
of  instances.  We  also  examine  the  effect  of  noise  on 
the  label  priors  and  present  results  which  support  the 
robustness  of  the  method  with  respect  to  varied  A  and 
temperature. 

4.1.  Experimental  Set-up 

Text  classification  has  been  a  major  target  of  semi- 
supervised  approaches,  (Nigam  et  al.,  2006),  and  we 
evaluate  on  the  simulated /real  auto/aviation  (SRAA) 
task.  We  examine  three  especially  difficult  natu¬ 
ral  language  processing  tasks:  the  CoNLL03  named- 
entity  recognition  task  (CoNLL03),  Part  of  speech 
tagging  of  the  Wall  Street  Journal  (POS),  and  the 
2006  Biocreativell  evaluation  (BIOII),  using  a  slid¬ 
ing  window  classifier.  Finally,  we  examine  a  protein 
secondary  structure  prediction  task  (SecStr),  as  ex¬ 
tensively  evaluated  in  Chapelle  et  al.  (2006).  Table 
1  shows  characteristics  of  the  various  data  sets.  The 
tasks  are  very  large  in  scale,  with  up  to  hundreds  of 
thousands  of  instance  and  features.  They  have  com¬ 
plex  characteristics  such  as  heavily  inter-dependent 
features  and  highly  skewed  class  distributions. 

Across  all  of  the  experiments  we  compare  with  su¬ 
pervised  naive  Bayes  and  maximum  entropy  models, 
and  semi-supervised  naive  Bayes  trained  with  EM  and 
maximum  entropy  models  trained  with  entropy  reg¬ 
ularization.  For  the  tasks  where  there  may  be  more 
features  per  instance  than  others,  we  used  document 
length  normalization  for  the  naive  Bayes  approaches 
which  we  have  found  to  sometime  significantly  improve 
accuracy.  On  the  secondary  structure  prediction  we 
additionally  compare  with  a  supervised  SVM  using  a 
radial-basis  function  (RBF)  kernel,  a  Cluster  Kernel 
(Weston  et  al.,  2006)  and  a  graph  based-method,  the 
Quadratic  Cost  Criterion  with  Class  Mean  Normaliza¬ 
tion  (Bengio  et  al.,  2006)  trained  using  various  data 
sub-sampling  schemes  (Delalleau  et  al.,  2006):  a  ran¬ 
dom  sampler  and  two  smarter  variations. 
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Figure  1.  BIOII:  Label  regularization  (XR)  outperforms 
all  other  methods.  The  x-axis  represents  increasing  num¬ 
bers  of  labeled  data  instances.  The  y-axis  is  the  F-measure 
micro  average  across  all  classes. 


Figure  3.  POS:  Label  regularization  (XR)  outperforms  all 
other  methods,  though  performance  improvements  over  su¬ 
pervised  maximum  entropy  methods  appear  to  level  off  at 
1300  labeled  instances. 
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Figure  2.  CoNLL03:  Label  regularization  (XR)  outper¬ 
forms  all  other  methods.  The  x-axis  represents  increasing 
numbers  of  labeled  instances  per  class,  and  the  x-axis  is 
accuracy. 


Figure  4.  SRAA:  Label  regularization  (XR)  outperforms 
its  supervised  maximum  entropy  counterpart  and  entropy 
regularization  and  is  the  winner  at  one  labeled  instance  per 
class.  After  that,  nai've  Bayes  EM  is  the  clear  winner. 


For  CoNLL03,  POS,  BIOII,  and  SRAA,  we  run  ten 

trials,  splitting  the  data  randomly  into  two  sections, 
training  and  test.  From  the  training  set,  we  randomly 
chose  some  instances  to  be  labeled  and  cause  the  rest 
to  be  hidden.  We  then  report  results  on  the  test  data 
(in  what  is  commonly  called  inductive  learning).  For 
SecStr  we  use  the  labeled/unlabeled  splits  provided 
by  Chapelle  et  al.  (2006)  and  evaluate  on  the  hidden 
training  data  (in  what  is  commonly  called  transductive 
learning).  In  order  to  provide  a  somewhat  more  fair 
comparison  with  the  RBF  kernels  used  by  the  other 
methods  on  this  task,  the  feature  set  used  by  the  max¬ 
imum  entropy  model  and  naive  Bayes  models  is  aug¬ 
mented  by  pairwise  feature  conjunctions,  correspond¬ 
ing  to  a  quadratic  kernel. 

For  the  maximum  entropy  model  trained 
with  entropy  regularization,  after  some  ex¬ 
perimentation,  we  weighted  its  contribu¬ 
tion  to  the  objective  function  with  A  = 
#  labeled  data  points  /  #  unlabeled  data  points. 

For  the  experiments,  we  use  the  true  label  priors 
estimated  from  data,  corresponding  to  a  use-case 
where  a  user  gives  this  knowledge  to  the  system 
during  training.  Section  4.3  presents  experiments 
showing  robustness  to  noisy  label  priors.  Across  the 
experiments,  we  observed  that  label  regularization 
trains  in  time  linear  in  the  amount  of  unlabeled  data. 


4.2.  Learning  Curves 

Figures  1,  2,  3,  and  4  show  classifier  performance  as 
greater  amounts  of  labeled  data  is  added.  In  POS, 
BIOII,  and  CoNLL03,  label  regularization  yields  sig¬ 
nificant  benefits  over  the  alternative  approaches  for  all 
amounts  of  training  data.  On  SRAA,  label  regular¬ 
ization  also  shows  a  benefit  over  the  fully  supervised 
maximum  entropy  model  but  its  accuracy  is  not  as 
high  as  that  obtained  by  the  EM-trained  naive  Bayes 
learner.2  At  one  instance  per  class,  label  regulariza¬ 
tion  is  unbeaten  and  yields  improvement  when  com¬ 
pared  to  all  other  approaches  considered.  Across  the 
experiments,  as  the  tasks  become  more  complicated, 
with  larger  feature  sets  and  more  unlabeled  data,  the 
label  regularizer  provides  increasingly  higher  accuracy 
than  EM  and  entropy  regularization. 

In  SecStr,  label  regularization  outperforms  the  other 
methods  at  100  labeled  points,  and  approaches  the 
cluster  kernel  method  on  1000  points.  At  only  2  la¬ 
beled  data  points,  it  outperforms  the  supervised  SVM 
and  maximum  entropy  model  when  they  are  trained 
with  100  labeled  points.  In  these  experiments  QC  is 
not  run  over  the  complete  data  (presumably  because  of 

2  Note  here  that  the  baseline  performance  of  the  max¬ 
imum  entropy  model  is  much  lower  than  the  naive  Bayes 
model,  so  that  label  regularization  starts  off  at  a  consider¬ 
able  deficit. 
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#  Lai 
2 

jeled  Insi 

100 

nances 

1000 

SVM  (supervised) 

55.41 

66.29 

Cluster  Kernel 

57.05 

65.97 

QC  randsub  (CMN) 

57.68 

59.16 

QC  smartonly  (CMN) 

57.86 

59.29 

QC  smartsub  (CMN) 

57.74 

59.16 

Naive  Bayes  (supervised) 

52.42 

57.12 

64.47 

Naive  Bayes  EM 

50.79 

57.34 

57.60 

MaxEnt  (supervised) 

52.42 

56.74 

65.43 

MaxEnt  +  Ent.  Min. 

48.56 

54.45 

58.28 

MaxEnt  +  XR 

57.08 

58.51 

65.44 

Table  2.  Label  regularization  outperforms  other  semi- 
supervised  learning  methods  at  100  labeled  data  points. 
At  one  instance  per  class,  its  performance  is  better  than 
the  supervised  SVM  and  maximum  entropy  model  at  100. 

scalability  problems),  but  operates  on  a  subset,  either 
selected  randomly  (randsub)  or  in  a  smarter  fashion 
(smartonly  and  smartsub),  while  the  label  regulariza¬ 
tion  method  uses  the  complete  data.  As  in  the  other 
experiments,  label  regularization  only  helps  accuracy, 
while  in  many  of  the  other  methods  (EM,  entropy  regu¬ 
larization,  cluster  kernels)  unlabeled  data  degrade  per¬ 
formance. 

We  have  tried  additional  experiments  combining  label 
regularization  and  entropy  regularization  and  in  most 
cases,  it  does  not  lead  to  improvements  over  label  regu¬ 
larization  alone  and  sometimes  decreases  the  accuracy 
of  label  regularization.  The  two  exceptions  are  on  the 
SRAA  and  the  SecStr  data  sets.  Notably,  on  Sec- 
Str,  combined  entropy  regularization  and  label  reg¬ 
ularization  yields  a  performance  of  66.30 — matching 
the  performance  of  the  supervised  radial-basis  SVM 
and  beating  all  other  unsupervised  methods. 


4.3.  Noisy  Priors 

The  previous  section  assumes  that  the  system  has  ac¬ 
curate  knowledge  of  the  prior  distributions  over  the 
labels.  In  this  section,  we  perform  a  sensitively  analy¬ 
sis  by  gradually  smoothing  the  class  distribution  until 
it  reaches  a  uniform  distribution.  We  add  noisy  counts 
v  to  the  true  counts  c(y ): 


P(y) 


c{y)  +  v 

Ey>  C(V')  +  V  ' 


As  more  noise  is  added,  the  prior  distribution  con¬ 
verges  to  uniform. 

Figure  5  demonstrates  the  effect  of  increasing  noise  in 
the  system.  At  v  =  1,000,  the  majority  class  proba¬ 
bility  drops  from  84%  to  80%  and  there  is  almost  no 


Added  Counts 


Figure  5.  CoNLL03:  The  x-axis  represents  increasing 
amount  of  noise  towards  a  uniform  distribution.  On  this 
data  set,  the  majority  class  is  84%  of  the  instances,  and 
so  the  uniform  distribution  is  an  extremely  poor  approxi¬ 
mation.  Performance  suffers  little  when  the  majority  class 
prior  is  erroneously  given  as  61  %{v  =  10,  000) 


Figure  6.  CoNLL03:  For  a  wide  range  of  A  and  temper¬ 
ature  the  performance  is  similar  and  surpasses  the  purely 
supervised  performance. 

loss  of  performance.  At  v  =  10,000  are  added,  the 
majority  class  probability  drops  to  61%  and  there  is 
only  a  slight  loss  of  performance.  At  v  =  le07  the 
majority  class  probability  has  dropped  to  11%,  a  vir¬ 
tually  uniform  distribution,  and  performance  has  lev¬ 
eled  off.  These  results  are  encouraging  as  they  suggest 
that  relatively  large  changes  (of  20%  absolute,  27%  rel¬ 
ative)  can  be  tolerated  without  major  losses  in  accu¬ 
racy.  Even  when  the  human  has  no  domain  knowledge 
to  contribute,  label  distribution  estimates  of  sufficient 
accuracy  should  be  obtainable  from  a  reasonably  small 
number  of  labeled  examples. 

4.4.  Robustness 

Along  with  robustness  in  the  face  of  noise  from  the 
estimated  label  priors,  the  model  is  robust  to  changes 
in  A  and  temperature.  As  can  be  seen  in  Figure  6,  A 
and  temperature  have  a  wide  plateau  over  which  their 
performance  is  stable.  At  some  extreme  values  of  A 
and  temperature,  the  performance  degrades,  and  can 
drop  below  supervised  performance.  This  trend  was 
observed  for  500  labeled  examples  (shown  in  the  fig¬ 
ure),  as  well  as  in  cases  when  there  as  little  as  one  la¬ 
beled  example  for  a  number  of  the  data  sets.  For  other 
semi-supervised  techniques  such  as  entropy  regulariza¬ 
tion,  extensive  tuning  is  required  across  for  each  indi¬ 
vidual  data  set  and  labeled/unlabeled  data  set  sizes  in 
order  to  improve  upon  supervised-only  performance 
(Jiao  et  al.,  2006). 
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5.  Conclusion 

This  paper  has  presented  expectation  regularization, 
a  new  method  for  semi-supervised  learning.  This 
method  penalizes  models  by  divergence  between  the 
model’s  expectations  over  the  unlabeled  data  and  con¬ 
ditional  probabilities,  which  can  be  estimated  from 
labeled  data  or  given  as  prior  knowledge.  An  im¬ 
portant  special  case,  label  regularization  is  empirically 
explored,  where  we  find  it  to  provide  accuracy  im¬ 
provements  over  entropy  regularization,  naive  Bayes 
EM,  Quadratic  Cost  Criterion  (a  representative  graph- 
based  method)  and  a  cluster  kernel  SVM.  Our  hope 
is  that  the  simplicity,  robustness  and  scalability  of 
this  method  will  enable  semi-supervised  learning  to  be 
more  widely  deployed. 

In  future  work  we  will  experiment  with  more  general 
cases  of  expectation  regularization,  in  which  the  hu¬ 
man  provides  expectations  on  feature/label  pairs.  We 
will  also  ultimately  apply  these  methods  to  structured 
models,  such  as  conditional  random  fields,  which,  as 
exponential  family  models,  are  also  a  natural  fit  for 
XR,  and  in  which  the  XR  gradient  can  still  be  effi¬ 
ciently  calculated  by  dynamic  programming. 
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